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ABSTRACT 


Problems related to confidentiality in information exchange are very important 
in the digital computer era. Audio steganography is a form of a solution 
that infuses information into digital audio, and utilizes the limitations of 
the human hearing system in understanding and detecting sound waves. 
The steganography system applies compressive sampling (CS) to the process 
of acquisition and compression of bits in binary images. Rivest, Shamir, and 
Adleman (RSA) algorithms are used as a system for securing binary image 
information by generating encryption and decryption key pairs before 
the process is embedded. The insertion method uses statistical mean 
manipulation (SMM) in the wavelet domain and low frequency sub-band by 


dividing the audio frequency sub-band using discrete wavelet transform 
(DWT) first. The optimal results by using our system are the signal-to-noise 
ratio (SNR) above 45 decibel (dB) and 5.3833 bit per second (bps) of capacity 
also our system has resistant to attack filtering, noise, resampling and 
compression attacks. 
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1. INTRODUCTION 

The era of digital computers and information technology has become part of the modern society's 
ecosystem today. Various types of digital information can be accessed and shared easily through various types 
of digital information service providers on the internet. The ease in exchanging digital information is used by 
a handful of people to intercept, interrupt, and modify the digital contained therein. Therefore, we need 
a technique that guarantees and secures the security and confidentiality of digital data, namely steganography. 

Steganography is the art of hiding messages inside media so the message inside it cannot be realized 
by other people. In digital steganography, the secret message requires digital media as vessel or host such as 
image, audio, text, and video [1, 2]. Robustness, security and hiding capacity are the three major performance 
criteria that revolve around the existing steganography methods [3]. Effective steganography should have 
the following characteristics: perceptual transparency (1.e. the cover and the stego object must be 
imperceptible), high embedding capacity, robustness to various types of attacks and high data rate of 
the embedded data [4]. 

In research [5, 6] states that the discrete wavelet transform (DWT) method has good imperceptibility 
and robustness and it is effective in overcoming the most common types of attacks that designed to destroy 
the secret message that embedded in the audio. Furthermore, in research [7] states that the audio quality result 
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through the compressive sampling (CS) process can still be heard clearly, while in research [8] states that 
the image quality result through the CS process can be reconstructed as before after passing the extraction 
process and Rivest-Shamir-Adleman (RSA) algorithm decryption process. In research [8-11], it was stated that 
the implementation of RSA encryption improves the security in wavelet domain-based steganography system 
by applying the encryption process before transmission, and the decryption process is applied after receiving 
the encrypted data. Furthermore, in research [12] states that the value of p and q in the RSA encryption process 
must have a certain value so we must carry out several experiments to check the p and q values are suitable 
with our proposed system. 

In this paper, we implement CS and RSA encryption on a binary image that embedded to improve 
security, embedding capacity and robustness to audio attacks. First, embed the binary image with the statistical 
mean manipulation (SMM) method into a digital audio host that has been divided into frequency sub-bands 
using DWT. The audio was attacked by using nine types of audio attack such as low-pass filter (LPF), 
band-pass filter (BPF), noise, resampling, time scale modification (TSM), linear speed change (LSC), pitch 
shifting, and two types of compression with motion pictures experts group, audio layer 3 (MP3) and advanced 
audio coding (AAC) formats. The structure of this paper consists of several sections. Section 1 describes 
introduction, section 2 describes research method, basic formulation of audio steganography method and also 
describes audio steganography system with embedding and extraction procedures, section 3 describes 
the performance of the audio steganography method, section 4 describes conclusion. 


2. PROPOSED METHOD 

Explaining research chronological, including research design, research procedure (in the form of 
algorithms, Pseudocode or other), how to test and data acquisition [1-3]. The description of the course of 
research should be supported references, so the explanation can be accepted scientifically [2, 4]. The audio 
steganography system is designed using stereo audio with the * wav format and 44100 Hz frequency sampling 
as the host. This type of steganography utilizes the weakness of the human auditory system. Human 
auditory system is only able to perceive and detect sound with a frequency range of 20 Hz~20 kHz or 
-5 dB~130 dB [13]. So, the audio components outside from the frequency range cannot be heard. 
The embedding of a secret message into digital audio changes the quality of the audio. Therefore, in order to 
choose a good embedding method for steganography system, we must pay attention to several criteria, such as 
imperceptibility, secure, capacity, speed, and robustness [13, 14]. While, the embedded data or message is 
a binary image, which is a black and white image. 

In general, the confidential information go through two main processes, namely the embedding 
process and extraction process. The system model scheme proposed in this study is illustrated in Figure 1. 
First, the binary image goes through the compressive sampling acquisition process, and then encrypted by 
using the RSA encryption algorithm. The RSA algorithm is explained in [15]. The next step is to perform 


the embedding process on wavelet domain using SMM into host audio. 
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Figure 1. The proposed of steganography system 


secret 1ragze 






3. RESULTS AND ANALYSIS 
3.1. Embedding system design 

The embedding process of a binary image into a digital audio file on an audio steganography system 
is being done by using the DWT-SMM method which previously had to go through the CS and RSA encryption 
process before being embedded. The processed audio file goes through the frequency dividing process that 
divides the audio signal frequency into a high sub-band and a low sub-band with DWT method, later the low 
sub-band is chosen as the embedding location. After that, the compressive sampling perform data acquisition 
from the binary image using bit compression and the output is be encrypted by using RSA encryption algorithm. 
The encryption process needs an encryption key or public key that has been generated before by using 
the RSA algorithm. 
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The next process is embedding after the host audio and the binary image has been processed. 
The embedding process uses the SMM method that embeds confidential information into low sub-band 
frequency in a host audio. The reason is based on research [16] which states that embedding the information 
message at low sub-band frequency can produce better robustness in terms of image distortion caused by LPF. 
Here's the formula for the SMM embedding technique [17]: 
for ‘1’ bit value: 


Xy(n) = Xwa (n) — HX + A.W; (1) 
for ‘0’ bit value: 
Xw (n) = Xwd (n) — Lx — A.W; (2) 


with xwa (n) is a signal in the wavelet domain, ux is the average signal of xwa (n), æ is a reliability factor in 
SMM that ensures the embedding reliability, w; is embedded message bits information, and x,, (7) is the audio 
that has been embedded with confidential information. After that, perform inverse to unite the two previously 
divided sub-bands into whole stego-audio or audio with confidential information message that are successfully 
embeded into it. The embedding process is shown in Figure 2. 


Xy (7) 
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Figure 2. The embedding process flowchart 
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3.2. Extraction system design 

The extraction process of a binary image from the digital audio file on audio steganography system is 
being done by using the DWT-SMM method through a DWT inverse process then decrypted with the RSA 
decryption algorithm and perform CS reconstruction to reveal the message information. The process is similar 
to the embedding process, which is to process the stego-audio through the sub-band frequency dividing process 
with DWT. The low sub-band frequency is processed because the embedded binary image as the message 
information is embedded at the low sub-band frequency. After successfully extracting the binary image then it 
has successfully decrypted by using RSA decryption algorithm with a decryption key or private key that has 
been generated before by using the RSA algorithm. The next process 1s to reconstruct the bit data of the binary 
image by using CS, so that the binary image can be revealed again. The model design process is shown in 
Figure 3. 


3.3. Pre-processing and CS acquisition process 

This stage is done before the data from the information message is embedded into the host audio. Data 
from the information message that through the acquisition process is a binary image with matrix size a X b. 
Briefly, the CS acquisition process has successfully compressed the matrix size from the binary image data 
by the predetermined compression ratio value. So, it can hide the intended data in messages while minimizing 
its size, enabling us to transfer the data with less overall burden in capacity [18]. CS can be formulated as 
follows [19]: 


EF 3) 


with A is matrix form of M x N commonly referred to as sensing matrix who provide the information about X, 
while z is stochastic or an error term that definitely occur with limited energy. The model design process is 
shown in Figure 4. 
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Figure 3. The extraction process 
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Figure 4. Pre-processing and CS acquisition diagram 


The stages of the process are described as follows : 

- Read the binary image files in a two-dimensional matrix M(m, n). 

- | Change the shape of the matrix by setting the matrix column to 8 as input before being processed in 
CS. So, when entering the CS process, the length of the matrix column becomes 256 with adjusting 
the row length of the matrix. The output is M(m’,n’). 

- Converts the previously modified matrix into decimal. The output is M(m",n"’), 

- Generates impulses with a matrix column length of 256. In the CS concept, the input matrix must be 
sparse, means that the value 0 often appears. 

- Perform scalar multiplication pn the matrix so the output M-(n) is obtained, namely binary image 
data that was successfully through the acquisition process. 

- The form of the output data from M-(n) is binary data. 


3.4. Post-processing and CS reconstruction process 

This stage is done after the data from the information message is extracted and decrypted from 
stego-audio. Data from the information message that through the reconstruction process 1s a binary image with 
matrix size a X b. Briefly, the CS reconstruction process has successfully restored the matrix size from 
the binary image data from the previous process to the original size. Compressible signal can be recovered 
from a set of few measurements. In fact, this is a key element of CS and how the sensing process relates to 
the sparse representation determines whether a signal can be recovered or not from the measurements [20]. 
The model design process is shown in Figure 5. 


3.5. RSA encryption process 

This stage is done before the information data is embedded into the host audio and also after going 
through the CS process. The purpose of the encryption process is to secure the binary images using a public 
encryption key. Figure 6 is illustrated that before entering the encryption process, image data in the form of 
binary data must be converted to decimal. After the encryption process was successful, the data converted back 
into a binary form. So, it can be processed again to the next stage, which is the embedding process. 
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Figure 5. Post-processing and CS reconstruction flowchart 
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3.6. Image decryption process 

This stage is done after the information data has been extracted from stego-audio. The purpose of 
the decryption process is to decrypt the binary images that have been extracted using the private decryption 
key. The decryption process more or less has the same procedure as in the encryption process as shown in 
Figure 7. 







Private Keys 
(ON) 














Binary to RSA Decryption | Decimal ta | Binary Image | 
Decimal Algorthm Binary (CS) 


Figure 7. Flowchart of RSA decryption process 


4. RESULT AND ANALYSIS 

System testing is done by using two schemes. The first scheme 1s done with non-optimized parameters 
that are more focused towards creating a good imperceptibility from stego-audio and large capacity. The second 
scheme is done with optimized parameters that focused on creating a good imperceptibility from stego-audio 
and good robustness against various types of attacks on the stego-audio. The proposed steganography system 
is evaluated based on the parameters Bit Error Rate (BER), Capacity (C), Objective Different Grade (ODG), 
Peak Signal to Noise Ratio (PSNR), and Structural Similarity Index Matrix (SSIM). BER value is determined 
by calculating the percentage of probability of bit error insertion and extraction yield results with an overall bit 
prior to insertion [21]. Capacity (C) or payload data refers to the number of bits embedded into the audio in 
a unit time, measured in bits per second (bps) [22]. The objective quality of the modified audio signal which is 
calculated using ODG. ODG value range starting from -4 to 0. ODG O which means the audio quality 
imperceptible [23]. PSNR is the ratio between the maximum value of the measured bit depth of the image and 
the amount of noise that affects the signal. PSNR is usually measured in units (dB). The greater the PSNR 
value, the better the image quality or closer to the original image. SSIM is an index to measure the degree of 
similarity between the two images, the image after processing compared to the original image. SSIM compares 
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distortion from luminance, contrast and structural. The SSIM value of 0 means there is no similarity between 
the two images, while the value of SSIM 1 means that the two images being compared are 
very identical [24].The test involves five different types of audio such as voice, piano, guitar, drum, and 
orchestra. The information message that embedded into host audio is a binary image with a size of 64x64 pixel 
that encrypted by using a pair of keys that have been generated with the RSA algorithm. Figure 8 is a binary 
image that is inserted into the audio. 


4.1. Compressive sampling performance 

CS performance testing on binary images is based on the parameter of the bit compression ratio at 
0.02 with side parameters that changed starting from using sides 8, 16, 32 and 64. The selection of bit 
compression ratio is 0.02 because it produces an image compression ratio of 62.52% when compared to the bit 
compression ratio of 0.025 or 0.03 with the image compression ratio of 75% and 100%. The smaller 
the percentage compression value of the image, the less the number of constituent bits that affect 
the computation time. Figure 9 provides a compressed images comparison with different side values. 

Based on Figure 9, it can be concluded that the smaller the value side, the quality of the compressed 
image is getting worse. Side 64 produces compressed images with good quality. This is caused by the bit 
compression process which reduces the number of bits in the binary insert image. 


-E à - ee~ 
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Figure 8. Information message Figure 9. CS compression images comparison 


4.2. RSA algorithm performance 

This stages is done by choosing the key pairs that have been generated with a success level of 
the encryption and decrypting the binary image. There are 210 pairs of a prime number keys that generated 
without any key pairs with the same element of prime number. The value of generated prime numbers is in 
the range of 0 to 50. Based on the simulation, only half of these key pairs that can successfully process 
encryption and decryption from the steganography system. RSA algorithm has limited functions when 
combined with the SMM method which plays a role in the embedding and extraction process. The value of 
p = 7 and the value of q = 11 is chosen to generate the encryption key pair and its decryption, because that 
value produces bit error rate (BER) # 0, and successfully in the encryption and decryption process. 
The computation time in the encryption and decryption process is shown in Table 1. 

The result of the encryption key and the decryption key is successfully generated 7 and 43 by using 
the value of p and q are selected. Based on the value, the key can be categorized as the key with the number of 
bits equal to 8. According to [25], the key with the number of bits equal to 32 can be cracked by takes 35.8 
minutes. While the key with the number of bits equal to 8 need only takes 1.2us. It can be concluded that 
the greater the number of bits in the key, the more difficult to break into. However, the time needed in 
the encryption and decryption process is lasting longer. 


Table 1. The computational time of RSA encryption and decryption 


Process Computational Time 
Encryption 0.001387 second 
Decryption 0.020651 second 


4.3. Non-optimized parameters performance 

In this stage, we are testing this scheme by using a uniform non-optimized parameters for 
decomposition level (n), length of the frame (Nframe) and threshold (thr) sequentially with 1, 1024 and 0.9 as 
input values. Relibility factor value (a) is a parameter value that has a different value for each audio. The value 
of reliability factor for audio type voice, piano, guitar, drum and orchestra based on the order are 0.0015, 
0.0055, 0.0035, 0.003 and 0.0035. There are two things that are analyzed, namely the CS influence on 
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computation time and audio robustness against the nine types of attacks that have been mentioned earlier. 
Table 2 shows the result of the computation time between CS and without using CS. 

From Table 2, the CS utilization can speed up computing time by 30%. It means that the embedding 
and extraction process is 30% faster when compared to without using CS. Moving on to the average value of 
BER analysis that obtained from nine types of audio attacks by using non-optimized parameters as shown in 
Figure 10. Based on the data contained in Figure 8 the average value of BER obtained by each type of audio 
attack can be categorized as bad value because the average value result touches the number 0.5. If 1t converted 
in percentage form, the error in the binary image compiler bit reaches 50%. 


Table 2. Computing time comparison 


Scheme Embedding Time Extraction Time 
With CS 1.319 second 0.801 second 
Without CS 1.902 second 1.117372 second 
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Figure 10. BER result by using non-optimized parameters 


4.4. Optimized parameters scheme performance 

Optimized parameters are obtained by evaluating the highest BER value in non-optimized parameters. 
So that, the parameters for decomposition level (n), length of the frame (Nframe) and threshold (thr) 
sequentially with 2, 2048 dan 0.9 as input values. In this scheme, the reliability factor value (a) also changes. 
The value of reliability factor for audio type voice, piano, guitar, drum and orchestra based on the order are 
0.00038, 0.00085, 0.00085, 0.00015 dan 0.00085. There are two things that are analyzed, such as a change in 
capacity value after using the optimized parameters and audio robustness against the nine types of attacks that 
have been mentioned earlier. Figure 11 contains a capacity comparison data between using non-optimized and 
optimized parameters. 


ms Optimized (withCS) m Non-optimized (with C3) 


21.5332 





Capacity (bps) 


Figure 11. Capacity comparison result between two schemes 


DWT-SMM-based audio steganography with RSA encryption and... (Fikri Adhanadi) 


1102 O ISSN: 1693-6930 


Based on Figure 11, It can be seen that there was a decreasing value in capacity. This is because to 
pursue the steganography criteria with good imperceptibility and have relatively good robustness against audio 
attack than in terms of capacity must be sacrificed. Figure 12 shows the average value of BER analysis that 
obtained from nine types of audio attacks by using optimized parameters and non-optimized parameters. Five 
out of nine types of attacks decreased in the average BER value, while for the noise type attack there was 
a slight increase. In the case of a TSM attack, LSC attack also pitch shifting did not change at all. 

Futhermore, we analyze the optimal parameters in each audio type based on Signal-to-Noise Ratio 
(SNR) and ODG. The SNR value obtained by each type of audio can reach rates above 40 decibels (dB) 
with ODG values ranging from -1 to -2. This indicates that the stego-audio result have minimal distortion 
or it can be said to be almost the same as the original audio file. Table 3 shows SNR and ODG results using 
optimal parameters. 


m Optimized m= Non-optimized 


AAC Compression | oo 
MPS Compression a ee 0.49 
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Figure 12. Comparison type of attack on average BER 


Table 3. SNR and ODG results with optimal parameters 


Host Audio voice piano guitar drum orchestra 
SNR 45.4526 49.6711 44.2453 58.9779 45.9193 
ODG -1.5108 -2.4067 -2.0748 -1.1216 -2.0543 


4.5. System performance results 

The optimal parameters of audio with an attacked condition and the binary image is successfully 
extracted. PSNR and SSIM values are obtained when BER = 0 is infinite (co) and 1, it means that extraction of 
a binary image has been successfully reconstructed and have similarities that are identical to the input image 
so that such values can be obtained. In the other hand, for attacks that make a tone shift, modification of time 
and speed on the audio signal produces a poor BER value so that it also affects the PSNR and SSIM values. 
The data from the analysis are shown in Table 4. 


Table 4. System performance results with optimal parameters 


Type of Attack Parameter BER PSNR SSIM 
LPF 12k 0 00 1 
15k 0 00 1 
Noise 40 dB 0 00 1 
50 dB 0 00 1 
MP3 128k 0 00 1 
192k 0 00 1 
256k 0 00 1 
Resampling 24k 0 00 1 
Time Scale Modification 1% 0.482 3.192 0.036 
2% 0.5 2.968 -0.032 
Linear Speed Change 1% 0.493 3.018 -0.019 
5% 0.481 3.194 0.019 
Pitch Shifting 1% 0.494 3.018 -0.022 
2% 0.492 2.261 -0.033 
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5. CONCLUSION 

In this paper, RSA encryption and Compressive Sampling for DWT-SMM-based Audio 
Steganography has been proposed. The utilization of RSA algorithm in order to increase the security toward 
binary image said to be successful but also has a limitation in key when combined with the SMM as 
the embedding and extraction method. The compressive sampling utilization in binary imagery has succeeded 
in producing a steganography system with a capacity of 5.3833 bps with optimal parameters, and also obtaining 
a steganography system with 30% faster computation time during the embedding and extraction process. 
The results with optimal parameters on testing by using proposed method succeeded in obtaining an SNR 
average value above 45 dB and obtaining ODG values in the range of -1 to -2 and successfully obtain a good 
level of imperceptibility with the stego-audio quality resembling the original audio.The results with optimal 
parameters on testing using this method also produce excellent PSNR and SSIM values on extracted images 
with BER 0 when stego-audio was attacked by using low-pass filter with a range between 12000-15000 Hz, 
noise with level > 40 dB, resampling with sample rate 24000 Hz, and MP3 compression with bit rate > 128 
kilobit per second. However, the proposed steganography system also has extremely low robustness against 
linear speed change, time scale modification dan pitch shifting attack with an average value of BER 1s 0.49. 
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